Sections required in your report:

Load Ames Housing Data

Brief description of the data set and a summary of its attributes

Ames Housing Data is related to the sale price of houses depending on the facilities of the house mainly the lot area, year built. We will need to select the important features by analyzing the missing values of each sector.

Initial plan for data exploration: Let's pick out just a few numeric columns to illustrate basic feature transformations.

From the above boxplot, it is observed that there are some outliers in lot area and Gr Liv area. We need to take of those. Actions taken for data cleaning and feature engineering: We need to filter data above 4000 of Gr Liv Area.

Pair plot of features

Now that we have a nice, filtered dataset, let's generate visuals to better understand the target and feature-target relationships: pairplot is great for this!

Key Findings and Insights, which synthesizes the results of Exploratory Data Analysis in an insightful and actionable manner: From the above pair plot, it is observed that Overall Qual, Gr Liv Area are related to SalePrice.

Formulating at least 3 hypothesis about this data:

Hypthesis1: Null Hypothesis - there is no difference between Overall Cond and Overall Qual Alternative Hypothesis - there is difference between Overall Cond and Overall Qual

Conducting a formal significance test for Hypothese1 and discuss the results. We will obtain our statistics, t-value and p-value. We will use scipy.stats library and ttest_ind() function to calculate these parameters.

Hypthesis2: Null Hypothesis - there is no difference between Lot Area and Gr Liv Area Alternative Hypothesis - there is difference between Lot Area and Gr Liv Area

Hypthesis3: Null Hypothesis - there is no difference between Full Bath and Garage Cars Alternative Hypothesis - there is difference between Full Bath and Garage Cars

Suggestions for next steps in analyzing this data: We can check the correlation among the features and target variable. We can then focus on the more related features.

From the graph above, we can deduct some of the highly correlated features and select only those ones for any future analysis.

A paragraph that summarizes the quality of this data set and a request for additional data if needed: From the above observations, we have seen that only two features (Overall Qual, Gr Liv Area) are more related to the target variable SalePrice. It would be great if we could get additional data which are more related to the price to build a machine learning model to predeict the price more accurately.